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Abstract 



Discovering frequent graph patterns in a graph database offers valuable information in a variety of applications. 
However, if the graph dataset contains sensitive data of individuals such as mobile phone-call graphs and web-click 
graphs, releasing discovered frequent patterns may present a threat to the privacy of individuals. Differential privacy 
has recently emerged as the de facto standard for private data analysis due to its provable privacy guarantee. In this 
paper we propose the first differentially private algorithm for mining frequent graph patterns. 

We first show that previous techniques on differentially private discovery of frequent itemsets cannot apply in 
mining frequent graph patterns due to the inherent complexity of handling structural information in graphs. We 
then address this challenge by proposing a Markov Chain Monte Carlo (MCMC) sampling based algorithm. Unlike 
previous work on frequent itemset mining, our techniques do not rely on the output of a non-private mining algorithm. 
Instead, we observe that both frequent graph pattern mining and the guarantee of differential privacy can be unified 
into an MCMC sampling framework. In addition, we establish the privacy and utility guarantee of our algorithm and 
propose an efficient neighboring pattern counting technique as well. Experimental results show that the proposed 
algorithm is able to output frequent patterns with good precision. 

1 Introduction 

Frequent graph pattern mining (FPM) is an important topic in data mining research. It has been increasingly applied in 
a variety of application domains such as bioinformatics, cheminformatics and social network analysis. Given a graph 
dataset T> = {Di,D2, ... , D n }, where each Di is a graph, let gid(G) be the set of IDs of graphs in T> which contain 
G as a subgraph. G is a frequent pattern if its count \gid(G)\ (also called support) is no less than a user-specified 
support threshold f. Frequent subgraphs can help the discovery of common substructures, and are the building blocks 
of further analysis, including graph classification, clustering and indexing. For instance, discovering frequent patterns 
in social interaction graphs can be vital to understand functioning of the society or dissemination of diseases. 

Meanwhile, publishing frequent graph patterns may impose potential threat to privacy, if the graph dataset contains 
sensitive information of individuals. In many applications, identities are associated with individual graphs (rather than 
nodes or edges) which are considered private. For example, the click stream during a browser session of a user is 
typically a sparse subgraph of the underlying web graph; in location-based services, a database may consist of a set of 
trajectories, each of which corresponds to the locations of an individual in a given period of time. Other scenarios of 
frequent pattern mining with sensitive graphs may include mobile phone call graphs [26 1 and XML representation of 
profiles of individuals. Therefore, extra care is needed when mining and releasing frequent patterns in these graphs to 
prevent leakage of private information of individuals. 

It has been well recognized that simple anonymization schemes that only remove obvious identifiers carry serious 
risks to privacy. Even privacy -preserving graph mining techniques (e.g. 11201 ) based on fc-anonymity [ 30 1 are now often 
considered to offer insufficient privacy under strong attack models. Recently, the model of differential privacy ifTTIl 
was proposed to restrict the inference of private information even in the presence of a strong adversary. It requires that 
the output of a differentially private algorithm is nearly identical (in a probabilistic sense), whether or not a participant 
contributes her data to the dataset. For the problem of frequent graph mining, it means that even an adversary who is 
able to actively influence the input graphs cannot infer whether a specific pattern exists in a target graph. Although 
tremendous progress has been made in processing flat data (e.g. relational and transactional data) in a differentially 
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private manner, there has been very few work on differentially private analysis of graph data, due to the inherent 
complexity in handling the structural information in graphs. 

In this paper we propose the first algorithm for privacy-preserving mining of frequent graph patterns that guar- 
antees differential privacy. Recently several techniques |5j [21] have been proposed to publish frequent itemsets in a 
transactional database in a differentially private manner. It would seem attractive to adapt those techniques to address 
the problem of frequent sM/?gra/?/£]mining. Unfortunately, compared with private frequent itemset mining, the private 
FPM problem imposes much more challenges. First, graph datasets do not have a set of well-defined dimensions 
(i.e., items), which is required by the techniques in ETI . Second, counting graph patterns is much more difficult than 
counting itemsets (due to graph isomorphism), which makes the size of the output space not immediately available in 
our problem. This prevents us from applying the techniques in J5] . We will explain the distinction between 1 5 2 1 1 and 
our work with more details in Section l2~3l 

Contributions. The major contributions of this paper are summarized as follows: 

1 . For the first time, we introduce a differentially private algorithm for mining frequent patterns in a graph database. 
Our algorithm, called Diff-FPM, makes novel use of a Markov Chain Monte Carlo (MCMC) random walk 
method to bypass the roadblock of an output space with unknown size. This enables us to apply the exponential 
mechanism, which is a general approach to achieving differential privacy. Moreover, unlike Q that relies on 
the output of a non-private itemset mining algorithm, our technique integrates the process of graph mining and 
privacy protection as a whole. This is due to the observation that both frequent pattern mining and the application 
of exponential mechanism can be unified into an MCMC sampling framework. 

2. Our approach provides provable privacy and utility guarantee on the output of our algorithm. We first show 
that our algorithm gives (e, <5)-differential privacy, which is a relaxed version of e-differential privacy. We 
then show that when the random walk has reached its steady state, Diff-FPM gives e-differential privacy. For 
utility analysis, because a private frequent graph mining algorithm usually does not output the exact answer, 
we quantify the quality of our result by providing a high-probability upper bound on how far the support of the 
reported patterns can be from the support threshold specified by the user. 

3. The most costly operation in our algorithm is counting the support of a pattern in the graph dataset, due to the 
fact that subgraph isomorphism test is NP-complete. In order to propose more efficiently a neighboring pattern 
in MCMC sampling, we develop optimization techniques that significantly reduce the number of invocations to 
the subgraph isomorphism test subroutine. 

4. We conduct an extensive experimental study on the effectiveness and efficiency of our algorithm. With moderate 
amount of privacy budget, Diff-FPM is able to output private frequent graph patterns with at least 80% precision. 

The paper is organized as follows: The basic concept and techniques for differential privacy, as well as a formal 
definition of the FPM problem are introduced in Section|2l Section[3]and Section|4]introduces our Diff-FPM algorithm, 
whose privacy and utility analysis is provided in Section|5] The experiment result is presented in Section[6] We review 
related work in Section|7]and Section[8]concludes our discussion. 



2 Preliminaries 

2.1 Frequent Graph Pattern Mining 

Frequent graph pattern mining (FPM) aims at discovering the subgraphs that frequently appear in a graph dataset. 
Formally, Let T> — {Di,D 2 , ■ ■ ■ , D n } be a sensitive graph database which contains a multiset of graphs. Each graph 
Di e T> has a unique identifier. Let G = (V, E) be a (sub)graph pattern, the graph identifier set gid(G) — {i : G C 
D L e V} includes all IDs of graphs in T> that contain a subgraph isomorphic to G. We call \gid(G)\ the support of 
G in T>. The FPM algorithm can be defined either as returning all subgraph patterns whose supports are no less than 

'We use 'graph pattern' and 'subgraph' interchangeably in this paper. 
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a user-specified threshold /, or as returning the top k frequent patterns given an integer k as input. One can easily 
convert one version to the other. 

All graphs we consider in this paper are undirected, connected and labeled. Note that each node has a label and 
multiple nodes can have the same label. Depending on the application, the patterns considered may be subject to a set 
7Z of rules which are related to domain knowledge or user specifications. It is common to place an upper bound on the 
number of nodes and/or edges in the patterns, or specify the set of possible labels. For example, if the graphs represent 
chemical compounds, a rule may require the degree of a vertex labeled l C(arbonY be no greater than 4. Another rule 
may specify that any output contains at least 5 vertices, in order to filter out some trivial patterns. 

Many non-private algorithms have been proposed for finding frequent subgraphs. The most representative ap- 
proaches include Apriori algorithm 1 1 8 1 and the gSpan [ 33 1 algorithm. The Apriori algorithm exploits the observation 
that if a graph pattern G is frequent, all its subgraphs must also be frequent. The algorithm works by exploring the 
search space, i.e., generating candidate patterns and pruning infrequent ones. The gSpan algorithm maps each graph 
to a unique minimum DFS code, which skips the candidate generation process. For a detailed review of graph pattern 
mining and other related work, please refer to Section|7] 



2.2 Differential Privacy 

Differential privacy ifTTl is a recent privacy model which provides strong privacy guarantee. Informally, a data mining 
or publishing procedure is differentially private if the outcome is insensitive to any particular record in the dataset. In 
the context of graph pattern mining, let T>, T>' be two neighboring datasets, i.e., T> and V differ in only one graph, 
written as ||Z> — T>'\ \ = 1. Let T> n be the space of graph datasets containing n graphs. 

Definition 1 (e-differential privacy). A randomized algorithm A is e -differentially private if for all neighboring 
datasets T>,T>' S T> n , and any set of possible output O C Range(A): 



Pr[A(V) € O] < e £ Pr[A{V) e O]. 



The parameter e > allows us to control the level of privacy. A smaller e suggests more limit posed on the 
influence of a single graph. Typically, the value of e should be small (e < 1). e is usually specified by the data owner 



and referred as the privacy budget. In section 5.1 our discussion is related to a weaker notion called (e, S) -differential 



privacy [10], which allows a small additive error factor of S. 

Definition 2 ((e, 5) -differential privacy). A randomized algorithm A is (e, 5) -differential private if for all neighboring 
datasets T>,T>' € T> n , and any set of possible output O C Range(A): 

Pr[A(V) eO}< e E Pt[A(V) e O] + S. 



Laplace Mechanism. The most common technique for designing differentially private algorithms is to add random 
noise to the true output of a function ifTTIl . The noise is calibrated according to the sensitivity of the function, which is 
defined as the maximum difference in the output for any neighboring datasets. Formally, 

Definition 3 (Sensitivity). For any function f : T> n — > E, the sensitivity of f is 

A/= max \f(V)-f(V')\. 

V,V:\\V-V\\ = 1 

Given a dataset T> and a numeric function /, the Laplace mechanism achieves e-differential privacy by releasing 
f(D) — f(T>) + Lap(Af /e), where Lap(X) denotes a random variable drawn from the Laplace distribution with 
mean of and variance of 2A 2 . 

Applying the Laplace mechanism requires the output of a function being numeric. In many applications, however, 
the output may be models, classifiers or graphs which contain structural information that are not easily perturbed by 
the Laplace mechanism. Thus it cannot be directly applied to the problem of frequent subgraph mining. Still, we can 
use this technique to report the frequencies of the patterns we output. 
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Exponential Mechanism. A more general technique of applying differential privacy is the exponential mechanism 
11241 . It not only supports non-numeric output but also captures the full class of differential privacy mechanisms. The 
exponential mechanism considers the whole output space and assumes that each possible output is associated with a 
real-valued utility score. By sampling from a distribution where the probability of the desired outputs are exponentially 
amplified, the exponential mechanism (approximately) finds the desired outputs while ensuring differential privacy. 

Formally, given input space T> n and output space X, a score function u : T>" xA'^l assigns each possible output 
x E X a score u(T>, x) based on the input T) e T> n , The mechanism then draws a sample from the distribution on X 
which assigns each x a probability mass proportional to exp(eit(2?, x)/2Au), where Au = maxvx,x>,x>' \ U {T^ , x ) — 
u(D' ', x) | is the sensitivity of the score function. Intuitively, the output with a higher score is exponentially more likely 
to be chosen. It is shown that this mechanism satisfies ^-differential privacy |24*1 . 

Theorem 1. H24V Given a utility score function u : T> n x X — > Rfor a dataset T>, the mechanism A, 

. . A .euCD, x) . 
A(D, x) = return x with probability oc exp( — J 

gives e -differential privacy. 

The exponential mechanism has been shown to be a powerful technique in finding private medians |8|, mining 
private frequent itemset J5] |2T) and more generally adapting a deterministic algorithm to be differentially private 
E51 . As discussed in Section [T] it is infeasible to find frequent graph patterns privately using the Laplace mechanism 
by adding noise to the support of each possible pattern. Our Diff-FPM algorithm works by carefully applying the 
exponential mechanism. In this process we must overcome several critical challenges, which are identified next. 



2.3 Challenges 

There has been work J5] |2T) on mining frequent itemsets in a transaction dataset under differential privacy. However, 
the shift from transactions to graphs poses significant new challenges, which make the previous techniques no longer 
suitable in our problem. In ED . transaction datasets are viewed as high-dimensional tabular data, and the proposed 
approach projects the input database onto lower dimensions. However, graph datasets do not have a well defined set of 
items, i.e., dimensions, which renders the approach in [21 1 inapplicable in our FPM problem. In 0, two methods are 
proposed which make use of a notion of truncated frequency. However, those methods cannot be used in our problem 
due to the following fundamental challenges: 

Support Counting. Obtaining the support of a graph pattern is much more difficult than counting itemsets. An itemset 
pattern can be represented by an ordered list or a bitmap of item IDs and does not contain structural information as in 
graphs. Checking the existence of an itemset in a transaction only takes 0(1) time (after simple data structures such as 
bitmaps have been built), while checking whether a subgraph pattern exists in a graph is NP-complete due to subgraph 
isomorphism. 

Unknown Output Space. The output space X in our problem contains a finite number of graph patterns which may 
or may not exist in the input dataset. Under differential privacy, any pattern in the output space should have non- 
zero probability to be in the final output. The knowledge of the output space is essential in applying the exponential 
mechanism, in which we need to sample a pattern x with probability 

n(x) = CXpM g /2AM) , (1) 

where C = ^2 xeX exp(eu(x) /2Au) is the normalizing constant according to Theorem[TJ The most straightforward 
way to compute C requires enumerating all the patterns in the output space. In |5 1, a technique is proposed to apply the 
exponential mechanism without enumerating if the size of the output space is known. However, unlike 0, in which 
the output space size can be obtained by simple combinatorics (i.e., Cyj patterns of size I given an alphabet of size 
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m), the size of the output space X in our problem is not immediately available (due to graph isomorphism^, which 
prohibits us from applying exponential mechanism directly. Therefore we cannot apply the same techniques as in 0. 

Given the analysis above, we need to develop new ways to overcome the issue of an unknown \X\. Note that 
although the global information on the output space is not accessible, we do have the local information on any specific 
pattern - given any pattern x, we can immediately calculate its utility score u[x) (related to \gid{x) |, see Section[3]for 
details). In addition, the unknown normalizing constant C is common to all patterns. That is, given any pair of patterns 
xi,X2, the ratio of probability mass ir(xi)/ir(x2) is available without knowing the exact probabilities, according to 
Eq.Q. Such scenarios, where one needs to draw samples from a probability distribution known up to a constant 
factor, also arise in statistical physics when analyzing dynamic systems, where Markov Chain Monte Carlo (MCMC) 
methods are often used. Inspired by that, our idea is to perform a random walk based on locally computed probabilities. 
By carefully choosing the neighbor and the probability of moving in each step using the Metropolis-Hastings (MH) 
method E9l . the random walk will converge to the target distribution, from which we can output samples. Next we 
discuss the details of our Diff-FPM algorithm. 



3 Private FPM Algorithm 

3.1 Overview 

The key challenge of handling graph datasets is the unknown output space when applying the exponential mechanism. 
The Diff-FPM algorithm meets the challenge by unifying frequent pattern mining and applying differential privacy into 
an MCMC sampling framework. The main idea of Diff-FPM is to simulate a Markov chain by performing an MCMC 
random walk in the output space. Our goal is that when the random walk reaches its steady state, the stationary 
distribution of the Markov chain matches the target distribution ir in Eq.([T]i. In Section 3.2.2 we will explain in detail 



how to apply the Metropolis-Hastings (MH) method in our problem to achieve this goal. Before that, we need to define 
the state space in which we perform the random walk. 

Partial Order Full Graph. To facilitate the MH-based random walk in the output space, we define the Partial Order 
Full Graph (POFG) as the state space of the Markov chain on which the sampling algorithm run the simulation. Each 
node in POFG corresponds to a unique graph pattern and each edge in POFG represents a possible 'extension' (add 
or remove one edge) to a neighboring pattern. Naturally, each node in the POFG has three types of neighbors: sub- 
neighbor (by removing an edge), super-backward neighbor (by connecting two existing nodes) and super-forward 
neighbor (by adding and connecting to a new node). 

Example 1. Figure [7] shows a simple graph dataset containing 3 graphs and its POFG. The dashed patterns have 
support smaller than 2 in the dataset. Pattern A — A — C has two sub-neighbors, one super-backward neighbor and 
several super-forward neighbors ( only one shown in Figure \l(b)ty . Self-loops and multi-edges are not considered in 
this example and thus are excluded from the output space. 

At a higher level, the random walk starts with an arbitrary pattern and proceeds to an adjacent pattern with certain 
probability in each step. Since the transition decision is made solely based on local information, there is no need to 
construct the global POFG explicitly. When the random walk has reached its steady state, the probability of being in 
state x follows exactly the target distribution n{x) in Eq.([T|). Then the current state is drawn as a sampled pattern. 
Since the frequent patterns have larger probabilities in the target distribution, they are more likely to appear in the final 
output. 

Before introducing the details, we need to make sure that the random walk on POFG we design indeed converges 
to a stationary distribution. A random walk needs to be finite, irreducible, and aperiodic to converge to a stationary 
distribution [29|. The analysis is similar to that in J3). 

2 Essentially, we need to answer the question 'Are there any closed form formula or polynomial time algorithms to count the number of graphs 
given the number of vertices, edges and a set of possible labels?'. (1) If the vertex labels are all unique, we know the number of graphs given n 
vertices is 2"( n— x )/ 2 . (2) If the graph is unlabeled, the problem is considerably harder due to graph isomorphism. Polya enumeration theorem 
provides an algorithm to compute the number of isomorphism classes of graphs with n vertices and m edges 1 28 1. But it gives neither a formula 
nor a generating function. (3) When the labels are not unique, the problem is at least as hard as the unlabeled case. 
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(a) Graph database with 3 graphs 








(b) Part of POFG of Figure [T(a)] 

Figure 1 : Example graph database and POFG 

3.2 Detailed Descriptions 

3.2.1 Backgrounds on Markov Chain 

A Markov chain is a discrete-time stochastic process defined over a set of states X. X can be finite or countably 
infinite. The Markov property requires that given the present state, the past and the future are independent. The 
stochastic process is characterized by the transition matrix P, which defines the probability of transition between any 
two states in X, i.e., P(x, y) is the probability that the next state will be y, given that the current state is x. For all 
x, y 6 X, we have < P(x, y) < 1, and J2 y P( x > ll) = h i- e -' P lS row-stochastic. 

A stationary distribution of a Markov chain with transition probability P is a probability distribution tt (a row 
vector of size \X\), such that tt = it P. 

If a Markov chain is finite, irreducible and aperiodic, regardless of where it begins, the chain will converge to the 
stationary distribution. We also say it has reached the steady state when the chain has converged. 

If the state space A" of a Markov chain is the set V of a graph Q = (S , V), and if for any u, v G V, (u, v) ^ £ 
implies P(u, v) — 0, then the process is also called a random walk on the graph Q. In other words, transitions only 
occur between adjacent nodes. 

3.2.2 Applying the MH method 

The MH method is a Markov Chain Monte Carlo (MCMC) method for obtaining a sequence of random samples from 
a target probability distribution for which direct sampling is difficult. It only requires that a function proportional to 
the probability mass be calculable. The main idea of the MH method is to simulate a Markov chain such that the 
stationary distribution of the chain matches the target distribution |29l . 

Suppose we want to generate a random variable X taking values in X = {x±, . . . , according to a target 
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distribution tt, with 

7r(a;i) = -^-i G 

where all b(xi) are strictly positive, \X\ is large, and the normalizing constant C = K 2 ^) is difficult to calculate. 

The MH method first constructs an | A"|-state Markov chain {X t , t — 0, 1, . . . } on X whose evolution relies on an 
arbitrary proposal transition matrix Q = {q(x 1 y)) in the following way: 

• When X t — x, generate a random variable Y satisfying P(Y = y) = q(x, y), y 6 X 

• If Y = y, let 

( y with probability a xyi 
t+1 [ x with probability 1 — a xy , 

where a rv — min < ^fa^fa- 3 ^ i L — m j n J ffil 9 ^' 3 ^ , 1 > . It means that given a current state x, the next state is 



ir(a;)(j(£c,j/) ' J ^ b(x)q(x,y) 

proposed according to the proposal distribution Q. q(x, y) is the probability mass of state y among all possible states 
given the current state is x. With probability a xy , the proposal is accepted and the chain moves to the new state y. 
Otherwise it remains at state x. It follows that {X t , t = 0, 1, . . . } has a one-step transition probability matrix P: 

P(x y) = { q ^ y ^ ax ^ tfx^y 
\ i-E^iS^^Kz, \fx = y 

It can be shown that for the above P, the Markov chain is reversible and has a stationary distribution tt, equal to 
the target distribution. Therefore, once the chain has reached the steady state, the sequence of samples we get from the 
MH method should follow the target distribution. Next we use an example to explain how the state transition works in 
our Diff-FPM algorithm. 



Example 2. Consider a random walk on the POFG illustrated in Figure 1(b) Suppose the current state of the 
walk is 'A-A-D' (pattern x). Following the MH method, one of pattern x's neighbors needs to be proposed accord- 
ing to a proposal distribution q(x, y). For simplicity, in this example each neighbor has an equal probability to be 
proposed, i.e., q(x,y) = l/|7V(a;)|, where N(x) is the neighbor set of x. Assuming 'A-D' (pattern y) is proposed 
and \N(x)\ = 5, |-/V(y)| = 10, &(•) = exp(\gid(-)\/2), the probability of accepting the proposal is calculated as 

a xy =min i exp ^f/, 2 l^ 1 / / 1 ff , 1 J- = 0.82. We can then draw a random number between and 1 to decide whether 



cxp(2/2)-(l/5) 

walking to pattern y or staying at x. 

The ability to generate a sample without knowing the normalizing constant of proportionality is a major virtue of 
the MH method. This salient feature fits perfectly the scenario when direct application of the exponential mechanism 
is formidable due to an unmanageable output space. 

The description of the Diff-FPM algorithm above can be summarized in Algorithm 1 . The input consists of the raw 
graph dataset T>, a support threshold / and the privacy budget e = E\ + £2- If the top-fc frequent patterns are desired, 
we first run non-private FPM algorithms such as gSpan ll33l to get the support threshold /, i.e., the support of the fcth 
frequent pattern. If one only needs k patterns whose supports are no less than a threshold, / can be directly provided 
to the algorithm. At a higher level, Algorithm 1 consists of two phases: sampling and perturbation. The sampling 
phase includes k applications of the exponential mechanism via MH-based random walk in the output space. 

Initially, we select an arbitrary pattern in the output space to start the walk (Line 2). At each step, we propose a 
neighboring pattern y of the current pattern x according to a proposal distribution (Line 4). The proposal distribution 



does not affect the correctness of the MH method, so we defer the details to Section 3.2.4 (it does affect the speed 
of convergence though). The proposed pattern is then accepted with probability a xy as in the MH-algorithm (Line 
5), where u(-) is the score function with Am being the sensitivity of «(•). We explore the design space of the score 



function in the next paragraph. When the Markov chain has converged (see Section 3.3 for convergence diagnostic), 
we output the current pattern and remove it from the output space (Line 6 to 8). We then start a new walk until k 
patterns have been sampled. Finally, if one wants to include the support of each output pattern as well, the count of 
each pattern is perturbed by adding Lapikje^) noise (Line 9). 
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Algorithm 1: Diff-FPM algorithm 



input : Graph data set T>, support threshold /, privacy budget Si, £2 
output: A set S of k private frequent patterns with noisy supports 

1 for i = 1 to k do 

2 Choose any pattern in the output space as the initial pattern; 

3 while True do 

4 Propose a neighboring pattern y of current pattern x according to the proposal distribution (Eq. Ej; 

5 Accept the proposed pattern with probability a xy = min { gjg^fjff^fc , 1 }; 

6 if convergence conditions are met then 

7 Add current pattern to S and remove it from the output space; 

8 break; 

9 (Optional) for each pattern in S, perturb its true support by Laplace mechanism with privacy budget Eijk; 



3.2.3 Score Function Design 

Choosing the utility score function is vital in our approach as it directly affects the target distribution. A general guide- 
line is that the patterns with higher supports should have higher utility scores in order to have larger probabilities to 
be chosen according to exponential mechanism. Under this guideline, given an input database V, the most straightfor- 
ward choice is to let u(x, T>) = \gid(x) | for any pattern x. In this case, the sensitivity Au is exactly 1 since the support 
of any subgraph pattern may vary by at most 1 with the addition or removal of a graph in the dataset. Other choices 
include assigning the same utility scores to all patterns having supports no less than /, or deliberately lowering the 
scores of the infrequent patterns. For example, let u(x) — a(\gid(x)\ — b) if \gid(x)\ < f, where < a < 1, b > 0, 
and u(x) — \gid(x) \ if \gid(x)\ > f. In this case, the infrequent patterns have even less probability to be sampled. 
However, this will also increase Au and thus deteriorate the utility, according to Theorem[T] We will further study the 
impact of various score functions in the experiment section. 

3.2.4 Proposal Distribution 

Although in theory the proposal distribution can be arbitrary, it can essentially impact the efficiency of the MH method 
by affecting the mixing time (time to reach steady state). A good proposal distribution can improve the convergence 
speed by increasing the accept rate a xy in the MH method. On the contrary, if the proposed pattern is often rejected, the 
chain can hardly move forward. It has been suggested that one should choose a proposal distribution close to the target 
distribution fl5l . In our problem setting, it is preferable to make a distinction between the patterns having support 
no less than / (referred as frequent patterns) and those whose supports are lower (referred as infrequent patterns). 
Given a current state x, we denote the set of frequent neighbors of x as Ni (x) and the set of infrequent neighbors 
as N 2 (x). Since |AT 2 (a;)| is usually larger than |JVi(x)|, we will balance the probability mass assigned to Ni(x) and 
AT 2 (x) by introducing a tunable parameter 77. For the same reason, we use p to control the bias toward either the sub- 
neighbors N\{x) or the super-neighbors Nf(x) within the desired set Ni (x). Our heuristic based proposal distribution 
is formally described below: 

Q(x,y) = l (1 - p)v x p^jp if yeNf(x) (2) 

The best values of rj and p can be tuned experimentally. If any of the three sets of neighbors in Eq|2]is empty, its 
probability mass will be re-distributed (by setting p = 0, p = 1 and 77 = 1 respectively). 

3.2.5 Pattern Removal 

In line 6 to 8 of Algorithm 1, after the convergence conditions are met and a sample pattern g is outputted, we need 
to exclude g from the output space by connecting g's neighbors and removing g in the POFG In our implementation 
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this is done by replacing g by all the neighbors of g whenever g appears in some pattern's neighborhood. Note that we 
do not output multiple patterns when the chain has converged. This is because once a pattern is sampled, it should be 
excluded from the output space and thus have zero probability to be chosen. Therefore adjustment to the output space 
is necessary after each sample. For the same reason we do not run multiple chains at once. 

3.3 Convergence Diagnostics 

The theory of MCMC sampling requires that samples are drawn when the Markov chain has converged to the stationary 
distribution, which is also our target distribution tt. The most straightforward way to diagnose convergence is to 
monitor the distance between the target distribution tt and the distribution of samples fr. In practice, however, tt is 
often known only up to a constant factor. To deal with this problem, several online diagnostic tests have been developed 
in the MCMC literature |fl5l and used in random walk based sampling on graphs |fl6ll . 

Online diagnostics rely on detecting whether the chain has lost its dependence on the starting point. In particular, 
two standard convergence tests Geweke diagnostic iTPfl and Gelman and Rubin diagnostic |13 | are commonly used, 
which are based on analysis of intra-chain and inter-chain properties respectively. Since our problem setting does not 
support running multiple chains at the same time, we will focus on the Geweke diagnostic. 

The Geweke diagnostic takes two non-overlapping parts (usually the first 0.1 and last 0.5 proportions) of the 
Markov chain and compares the means of both parts to see if they are from the same distribution. Specifically, let X 
be a sequence of samples of our metric of interest and Xi,X2 be the two non-overlapping subsequences. Geweke 
computes the Z-score: Z = e(x 1 )-e(x 2 ) With increasing number of iterations, Xi and Xo should move 

F ^Var(X 1 ) + Var(X 2 ) 6 

further apart and become less and less correlated. When the chain has converged, X\ and X2 should be identically 
distributed with Z ~ N(0, 1) by law of large numbers. We can declare convergence when Z has continuously fallen 
in the [—1,1] range. Since the samples in our problem are graph patterns rather than a scalar, we may need to monitor 
multiple scalar metrics related to different properties of the sampled pattern and declare convergence when all these 
metrics have converged. 

We need to acknowledge that these convergence diagnostic tools from the MCMC literature are heuristic per se. 
Verifying the convergence remains an open problem if the distribution of samples is not directly observable. Even 
so, Diff-FPM still achieves (e, 5) -differential privacy if there exists a small distance between the target and simulation 
distributions, as we will show in Lemma|2]in Section|5] 



4 Efficient Exploration of Neighbors (EEN) 

We have discussed so far the core of the Diff-FPM algorithm and seemingly it could be run straightforwardly. How- 
ever, without certain optimization, the computation cost might render the algorithm impractical to run. The most 
costly operation in the Diff-FPM algorithm is proposing a neighbor of the current pattern x. According to the proposal 
distribution in Eq|2] this requires knowledge on the support of each pattern in x's neighborhood. Due to the fact that 
subgraph isomorphism test is NP-complete, obtaining the support of each neighbor might become a computation bot- 
tleneck. To overcome this problem, we have developed an efficient algorithm (called EEN), which aims at minimizing 
the number of invocations to the subgraph isomorphism test subroutine. Experimental result in Section [6] shows that 
the time cost per iteration can be reduced by up to an order of magnitude using this optimization. 

4.1 Problem Formulation 

In order to propose a neighbor y of a pattern x according to the proposal distribution, we need to investigate the 
neighbor set N(x) of x and test the frequentness of each neighbor y £ N(x). The task of neighbors exploration can 
be described as: given a pattern x, find the set of frequent sub-neighbors N^(x), frequent super-neighbors Nf (x) and 
infrequent neighbors A^x), as introduced in the proposal distribution (see Eq|2|. 

The neighbor set N(x) is composed of two parts - super-neighbors N p (x) and sub-neighbors N b (x). A pattern y 
is a super-neighbor of x if y = x o e and x C y (we use C to denote subgraph relationship), where e is a new edge and 
o is an extension operation. If e connects two existing nodes in x, it is called a back edge. Otherwise, a new node is 
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created with a random label from a label set L and then connected to an existing node in x. In this case the new edge is 
called a forward edge. Thus N p {x) = N^ ack (x) U Nj wd (x), where Nj] ack (x) and N^ wd (x) are the sets of super-back 
and super-forward neighbors of x respectively. 

Similarly, pattern y is a sub-neighbor of x if x = y o e and y C x. There are two types of edge removals as 
well. Back edge removal removes an edge and keeps the remaining pattern connected with no vertex removed, while 
forward edge removal isolates exactly one vertex which is also removed from the resulting pattern. The above neigh- 
bors generation process ensures the random walk is reversible (which is sufficient for the chain to have a stationary 
distribution), i.e., for any neighboring patterns x and y, if there is a walk from x to y, y can also walk to x and vice 
versa. 

4.2 The EEN Algorithm 

A naive way to populate N b (x),Nf(x) and Nz(x) is to test each neighbor of x against the graph dataset T>. However, 
this is extremely inefficient since | JV(sc) | • \T>\ isomorphism tests are required, where \T>\ is the number of graphs in 
T>. A simple optimization would be using the monotonic property of frequent patterns: if £ is a frequent pattern, any 
subgraph of x should be frequent too; likewise, an infrequent pattern's super-graph must be infrequent. However, the 
naive method is still required for exploring N p (x) if x is frequent or N b (x) if x is infrequent. 

The EEN algorithm is able to further optimize the number of isomorphism tests. Observing that x and y only differ 
in one edge for all y £ N(x), the main idea is to re-use the isomorphic mappings between x and Di 6 T> and examine 
whether any of the isomorphic mappings can be retained after extending an edge. The EEN algorithm is formally 
presented in Algorithm 2 and is described in the following. 

Algorithm 2 takes pattern x, graph dataset T> and support threshold / as input and returns N b (x), Nf(x) and 
N 2 {x). First, pattern x is tested against each graph in T> and the result is stored in B x — {i\x C Di, Di £ T>}, which 
is the set of IDs of graphs containing pattern x (line 2). The subgraph isomorphism algorithm we use is the VF2 
algorithm [7 |. Next we populate three types of neighbors of x: sub-neighbors N b , super-back neighbors N£ ack and 
super-forward neighbors N^ wd (line 3), and handle them differently. 

Explore sub-neighbors (line 4 to 7). For N b , if x is frequent, the entire set N b should be frequent. If x is infrequent, 
each pattern in N b is examined by the boolean sub-procedure SUB_IS_FREQ (line 40 to 44). SUB_IS_FREQ takes a sub- 
neighbor x' of x and B x as input and returns the frequentness of x' . First we find Be = C\ eex / B e , the intersection 
of ID sets of all edges in pattern x' . Then subgraph isomorphism test is only needed for the graphs Di G Be\B x . 
The set C of IDs of graphs that succeed the test together with B x comprise B x >. Finally the procedure returns the 
frequentness of x' by comparing / and the size of B x i . 

Explore super-back neighbors (line 8 to 22). For N^ ack , if x is infrequent, the entire N? h must be infrequent. 
Otherwise, we test whether x' E N% ack is a subgraph of Di for each Di. In this part, the EEN algorithm does not 
require any additional subgraph isomorphism test at all. This is achieved by re-using the isomorphism mappings 
between the base pattern x and Di and reasoning upon that. In line 12 we find the subgraph isomorphism mappings 
M : V x — » V£j. , which can be obtained at the same time when computing B x in line 2. Suppose x is extended to x' by 
connecting node u and v (line 15). If any of the isomorphism mappings m € A4 is preserved with the edge extension 
(i.e., m(u) and m(v) are adjacent in Di), then x' must be a subgraph of Di. Otherwise if none of the mappings can be 
preserved, x' is not a subgraph of Di. 

In the above process, we use a dictionary H to keep track of the number of graphs in T> so far which contains x' 
as a subgraph, i.e., H[x'] maintains |{£>j|x' C Di}\ for the _D, ; tested so far. Line 14 ensures that the isomorphism 
extension test is only performed when H[x'] has not and is able to reach /. 

Explore super-forward neighbors (line 23 to 37). For N^ wd , the algorithm is similar to the procedures of exploring 
super-back neighbors, except that the extension test is now on a forward edge instead of a back edge. Specifically, let 
v be the new node extended from u (line 30), if there exists a node w £ Di satisfying 1) has the same label as it; 2) is 
adjacent to m{u); and 3) is not part of the mapping to, then the isomorphism can be extended, meaning x' C Di. 
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5 Privacy and Utility Analysis 



5.1 Privacy Analysis 

In this part we establish the privacy guarantee of Diff-FPM described above. We show both the sampling and pertur- 
bation phases preserve privacy, and then we use the composition property of differential privacy to show the privacy 
guarantee of the overall algorithm. 

In the sampling phase, our target probability distribution tt(Z>, •) equals cx P( £ i"(^V)/ 2fcA ") f or a gj ven dataset V. If 
samples were drawn directly from this distribution, it would achieve strict ^-differential privacy due to the exponential 
mechanism. Since we use MCMC based sampling, the distribution of the samples n(T>, ■) will approximate ir(T>, •), 
i.e. the two distributions are asymptotically identical. In real simulation, there may be a small distance between the 
two distributions. To quantify the impact on privacy when a small error is present, we use the total variation distance 
1 29 1 to measure the distance of the two distributions at a given time: 

||7r(.)-7r(.)||TV = max|*(T)-7r(T)| (3) 

which is the largest possible difference between the probabilities that ir(-) and tt(-) can assign to the same event. 

Let .4(2?) denote the process of sampling one pattern according to Algorithm 1 (Line 4 to 10). The privacy 
guarantee that A(D) offers is described by the following lemma: 

Lemma 2. Let n(-) and 7r(-) denote the target distribution and the distribution of samples from A(T>) respectively. 
Suppose \\fr(-) — tt(-)\\tv < procedure A(T>) gives S) -differential privacy, where 5 = 9(l + e £l / k ). 

Proof. Vx e X, the ratio of density at x for two neighboring input T> and T>' can be bounded as 

■it(V,x) ir(V 7 x) + 9 
tt(D',x) ~ %(D',x) 

tt(V',x) ■ e El / k + 9 



< 



< 



tt(V',x) 
(0 + ?T(V',x))e El / k 



0( 1 + e ei/fc) 
tt(V,x) 



Therefore, 



tt(X>, x) < e El/k Tr(T> / , x) + 9(1 + e €l/k ) 
giving I % , 9(1 + e 6l / k ) ) -differential privacy. □ 



Note that 9 is a function of simulation time t. The following lemma describes the asymptotic behavior and the 
speed of convergence of the chain : 

Lemma 3. H29V If a Markov chain on a finite state space is irreducible and aperiodic, and has a transition kernel P 
and stationary distribution 7r(-), then for x € X, 

\\P t (x,-)-n(-)\\ TV <Mp t , t= 1,2,3,... (4) 

for some p < 1 and M < oo. And 



lim ||P*(a:,.)-7r(.)||rv = (5) 



The theorem above means 9 is decreasing at least at a geometric speed and approximates to zero when the simula- 
tion is running long enough. 

Since the sampling process in Algorithm 1 consists of k successive applications of exponential mechanism based 
on random walk, we need the following well-known composition lemma to provide privacy guarantee for the entire 
sampling phase. 
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Lemma 4. H23V Let A\, . . . , At be t algorithms such that At satisfies Si-differential privacy, 1 < % < t. Then their 
sequential composition (Ai, . . . , At) satisfies e-differential privacy, for e = 5Zi=i £ i- 

Equipped with the results in previous lemmas, we are able to provide the privacy guarantee for Algorithm 1 . 

Theorem 5. Algorithm 1 satisfies e-differential privacy. 

Proof. According to Lemma [3] when the chain has reached the steady state, 9 in Lemma [2] becomes zero, giving 
^-differential privacy in each output pattern. Using the composition lemma, the sample phase satisfies ei-differential 
privacy as a whole. In the perturbation step, we add Laplace noise Lap(fc/e 2 ) independently on each of the true 
supports of the fc patterns. Again by Lemma [4] the perturbation phase gives £ 2 -differential privacy. Therefore the 
entire Algorithm 1 achieves e-differential privacy since e = £\ + e 2 . □ 

5.2 Utility Analysis 

Because neighboring inputs must have similar output under differential privacy, a private algorithm usually does not 
return the exact answers. In the scenario of mining top-fc frequent patterns, the Diff-FPM algorithm should return a 
noisy list of patterns which is close to the real top-fc patterns. To quantify the quality of the output of Diff-FPM, we 
first define two utility parameters, following [5]. Recall that / is the support of the fcth frequent pattern, and let (3 be 
an additive error to /. Given < 7 < 1, we require that with probability at least 1 — 7, (1) no pattern in the output has 
true support less than f — (3 and (2) all patterns having support greater than f + f3 exist in the output. The following 
theorems provide the utility guarantee of Diff-FPM. A score function u(x) = \gid(x) \ is assumed. 

Theorem 6. At the end of the sampling phase in Algorithm 1, for all < 7 < 1, with probability at least 1 — 7, all 
patterns in set S have support greater than f — (3, where ft = ~ (ln(fc/7) + In M) and M is an upper bound on the 
size of output space. 

Proof. In any of the fc rounds of sampling, the probability of choosing a pattern with support / — ft given that a pattern 
having support > / is still present is at most /e^*~ = exp(— ei/3/2k). Although the size m of the output 

space is unknown without enumeration, one can usually get an upper bound M without considering the isomorphism 
classes. Since there are at most M patterns with support less than / — j3, after fc rounds of sampling the probability is 
upper bounded by fcA/exp(— ei/3/2k). Then 

7 > kM exp(-ei^/2fc) 
2k 

^f3>— ln(fcM/ 7 ) 
£1 

□ 

The following theorem provides the upper bound of noise added to the true support of each output pattern. 

Theorem 7. For all < 7 < 1, with probability of at least 1 — 7, the noisy support of a pattern differs by at most j3, 
where /3 = ^ ln(l/7). 

Proof. This is a property directly followed by integrating the Laplace distribution: 7 < 2 JJ° || exp( ~ T } f 2 )dr = 
cxp(^| £2 -), which transforms to (3 < j- ln(l/7). □ 

6 Experimental Study 

In this section, we evaluate the performance of Diff-FPM through extensive experiments on various datasets. Since this 
is the first work on differetially private mining of frequent graph patterns, the quality of the output is compared with the 
result from a non-private FPM algorithm and the accuracy is reported. In addition, we demonstrate the effectiveness 
of the EEN algorithm by comparing the time cost per iteration to two basic methods. We also discuss the running time 
and scalability of Diff-FPM and the impact of various parameters such as privacy budget, the number of output patterns 
and the size of the graph dataset. In this section we consider the scenario of mining the top-fc frequent patterns. 
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6.1 Experiment Setup 



Datasets. The following three datasets are used in our experiment: DTP is a real dataset containing DTP AIDS 
antiviral screening datasej^] which is frequently used in frequent graph pattern mining study. It contains 1084 graphs, 
with an average graph size of 45 edges and 43 vertices. There are 14 unique node labels and all edges are considered 
having the same label. 

The click dataset consists of 20K small tree graphs (4 nodes and 3 edges on average) obtained by a graph generator 
developed by Zaki [34 1 . To a certain extent, this synthetic dataset simulates user click graphs from web server logs 
11341 . which is a suitable type of data requiring privacy-preserving mining. All the tree graphs in this dataset are 
sampled from a master tree. In our experiment the master tree has 10,000 nodes with a depth of 10 and a fanout of 6. 

The above two datasets contain graphs that are relatively sparse. To test our algorithm on dense graphs, we also 
use a dataset containing 5K graphs, in which the average node degree is 7. Each graph contains 10 vertices and 35 
edges on average. The graph generator [6| we use is specially designed for generating graph datasets for evaluation 
of frequent subgraph mining algorithms. The size of this graph dataset is comparable to the largest datasets used in 
previous works lf33l [T8l . 

Utiliy metrics. We evaluate the quality of the output of Diff-FPM by employing the following two utility metrics: 

• Precision. Precision is defined as the fraction of identified top k graph patterns that are in the actual top fc, i.e., 

. . I True Positivesl 

Precision — 

k 

This is the complementary measure of the false negative rate used in 0. 

• Support Accuracy. The measure of precision reflects the percentage of desired/undesired patterns in the output, 
yet it cannot indicate how good or bad the output patterns are in terms of their supports. For example, if 
/ = 1000, it is much more undesirable if a pattern with support 10 appears in the output compared to a pattern 
with support 980, even though the precision may be the same in these two cases. We first define the relative 
support error (RSE) as 

_ (jtrtte ~ S ou t) /k 

where Strue and S ou t are the sum of the supports of the real top-fc patterns and sum of the supports of the 
sampled patterns respectively. This measure reflects the average deviation of an output pattern's support with 
respect to the support threshold /. In the plots, the support accuracy is reported, which equals 1 — RSE. 

All experiments were conducted on a PC with 3.40GHz CPU with 8GB RAM. The random walk in the Diff-FPM 
algorithm consumes only a small amount of memory due to its Markovian nature, i.e., earlier states in the walk do not 
need to be remembered. We can, however, allocate extra memory to cache some of the patterns and their neighbors. 
We implemented our algorithm in Python 2.7 with the JIT compiler PyP}j^]to speed up. The default parameters of 
e = 0.5 and k = 15 were used unless specified otherwise. In the experiment we do not release the noisy supports of 
the patterns in the output (line 9 in Algorithm 1), so all the privacy budget is used in the sampling phase. 



6.2 Experiment Results 

Comparison of neighbor exploration methods. In Section [4] we proposed the EEN algorithm to efficiently explore 
the neighborhood of a pattern. We now compare it with two other methods: a naive approach which finds the support 
of each neighbor of the current pattern x and a basic approach which uses the monotonic property of frequent patterns 
(see Section 4.2 1. Figure [2] shows the average iteration time in logarithm of the three methods over three datasets. In 



each iteration, a neighboring pattern is proposed and then accepted or rejected according to the MH algorithm. Clearly, 
EEN takes significantly less time in each iteration than the other methods in both datasets, reducing the iteration time 
by at least an order of magnitude compared to the naive approach. Thus all subsequent results are presented with EEN 
enabled. 
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Figure 3: Impact of graph dataset size 



Run time and scalability. Figure 3(c) illustrates the average time taken to output one frequent pattern as the size of 
the dataset increases. For the full datasets, click takes 20 seconds, DTP takes about 1 minute and dense sits in the 
middle, although the click dataset contains 20K graphs compared to only IK in the DTP. It indicates that the size of 
each individual graph and the size of the neighborhood have a larger impact on the run time than the total number of 
graphs in the dataset (note that DTP has 14 labels and thus a larger neighborhood of a pattern compared to dense). For 
scalability, all datasets are observed to have linear scale-up in time as the size of graph dataset increases. 

Precision and support accuracy. We now examine the quality of the output by studying the precision and support 
accuracy (SA) of the Diff-FPM algorithm under various parameter settings. 

First, Figure [3(a) and Figure 3(b) show the precision and SA when we increase the size of the graph dataset from 
10% to 100% |^| An increasing trend of the output quality can be clearly observed here. This is in line with our 
expectation because achieving differential privacy is more demanding in a small dataset - the larger the number of 
records in the database, the easier it is to hide an individual record's impact on the output. For all three full datasets, 
Diff-FPM is able to achieve at least 80% on both precision and SA. 



Figure 5(a) shows the precision when varying privacy budget e. With a very limited budget (e = 0.1), only about 
30% of samples are from the real top-fc patterns for DTP and dense. This is inevitable due to the privacy-utility 
tradeoff. As more privacy budget is given, the precision of Diff-FPM increases fast. At e = 0.5, the precisions from 
all datasets have reached 80%. Further increase in privacy budget does not provide significant benefit on the precision. 



We observed a similar trend in the support accuracy plot (Figure 5(b) i, with less dramatic changes for e from 0.1 to 
0.5. 

Figures 6(a) and |6(b)| illustrate the impact of the number of patterns in the output. Recall that in each round of 
sampling, a budget of e/k is consumed (cf. proof of Theorem |5j. Given a certain privacy budget, the more patterns 



; http : / /dtp .nci ■ nih . gov/ clocs/aids/aids_data ■ html| 
A http : / /pypy ■ org] 

5 The data point for dense at 10% is absent since the smallest dataset size can be generated is IK. 
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Figure 5: Precision and accuracy versus e 



to output, the less privacy budget each sample can use. Thus we expect the average quality of the output to drop 
as k increases, which is confirmed in the result. Meanwhile, the support accuracy of the output holds well with the 



increasing number of output, which can be seen in Figure 6(b) 



Score function. In Section 3.1 we discussed the principles of designing the score function. Here we experimentally 
compare several basic choices on the synthetic dataset. Figures [4(a)| and [4(b)] show the precision and support accuracy 
of two score functions linear and plateau, linear represents the most straightforward choice: u(x) = \gid(x)\ for any 
pattern x, with Au = 1. plateau treats all the patterns in {x\gid(x) > /} the same, i.e., u(x) = /, if \gid(x)\ > /; 
u(x) — \gid(x)\, if |<?z<i(:z;)| < /. The random walk with the plateau score function is able to traverse more patterns 
in the POFG. However, as shown in the plots, this does not lead to better precision and support accuracy in the 
result. Over the range of different graph dataset sizes, the linear score function consistently performs better due to the 
exponentially amplified probability mass for more frequent patterns. Therefore we use the linear score function for 
the rest of the experiment. 



Impact of proposal distribution. Recall that two parameters have impact on our proposal distribution (Section 3.2.4 1: 
r\ balances the weight on frequent/infrequent neighbors and p balances the weight on sub-neighbors/super-neighbors 
within the frequent neighbors. Note that the proposal distribution does not affect the correctness of the MH sampling, 
but it does affect the speed of convergence. Here the impact of rj is measured by the average accept rate in the entire 
walk, i.e., the rate that a proposed pattern is accepted on average. Since frequent patterns have exponentially large 



probability mass to be sampled, a larger value of r\ should be desired. This is reflected in Figure 7(a) in which the 
average accept rate increases from about 35% when r\ — 0.4 to more than 60% at r\ — 0.9. The other parameter p 
controls the probability mass of sub-neighbors given that a frequent pattern will be proposed. In graph pattern mining, 
the smaller graphs usually have larger support. Therefore a p of at least 0.5 is preferred, which can be seen by the drop 



of average accept rate from 60% to less than 40% when p decreases from 0.5 to 0.4 in Figure 7(b) Interestingly, as 
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we deviate away from 0.5, the acceptance rate slowly drops and adversely affects the sampling performance. This is 
because a balanced sub-neighbor/super-neighbor proposal allows faster transition from one pattern to another, making 
the chain well mixed instead of lingering in a local region. 

Convergence analysis. A decision we have to make is when to stop the random walk and output a sample. In 



Section 3.3 we introduced Z-score based Geweke diagnostic, which compares the distribution at the beginning and 
end of the chain. Since MCMC is typically used to estimate a function of the underlying random variable instead of 
structural data like graphs, we need to choose some properties of the patterns which we will monitor using the Geweke 
test. The three metrics we use in the experiment are the number of neighbors N(x), the number of frequent neighbors 
Ni (x) and the number of nodes in the pattern |a;|. Figure[8]shows the convergence traces of a sample run with K = 20 
and e = 0.5 on the DTP dataset. Each curve corresponds to the Z-score of a chain over the number of iterations. 
It can be seen that the Markov chain we design has pretty fast convergence rate thanks to the tuning of the proposal 
distribution. For each chain, convergence is declared when the Z-scores of all three metrics have fallen within the 
[—1, 1] range for 20 iterations continuously. In Figure[8] this happens around 150 iterations for most chains. 



7 Related Work 

In a broad sense, our paper belongs to the general problem of privacy-preserving data mining - a topic that has been 
studied extensively for a decade because of its numerous applications to a wide variety of problems in the literature. 
A general overview of various research works on this topic can be found in (TJ. Below we briefly review the results 
relevant to this paper. 

Data Mining with Differential Privacy. Ever since differential privacy ifTTIl was proposed and embraced by the 
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database community, the privacy requirement that various works try to achieve has shifted from syntactic models like 
fc-anonymity BUI to the more rigorous model of differential privacy. A formal introduction to differential privacy can 



be found in Section 2.2 There exist two basic approaches to differentially private data mining. In the first approach, 
the data owner releases an anonymized version of the original dataset under differential privacy. And the user has 
the freedom of conducting any data mining task on the anonymized dataset. We call this the 'publishing model'. 
Examples include releasing anonymized version of contingency tables [|4l[32), data cubes [9] and spatial data H). 
The general idea in these work is to release tables of noisy counts (histograms) and study how to ensure they are 
sufficiently accurate for different query workloads. In the other approach, differential privacy is applied to a specific 
data mining task, such as decision tree induction [12], social recommendations [22] and frequent itemset mining [5|. 
The problem addressed in this paper falls into this category. In these works, randomness is often injected to the 
intermediate results or sub-procedures of a mining algorithm. While the output of the first approach is more versatile, 
the second approach often leads to better utility (for specific data mining tasks) since privacy-preserving techniques 
are particularly designed for that data mining algorithm. 

Privacy-Protection of Graphs. The aforementioned works on differentially private data mining all deal with struc- 
tured data (tables or set-valued data). For graph data, there are research efforts HI to anonymize a social network 
graph to prevent node and edge re-identification. But most of them focus on modifying the graph structure to sat- 
isfy /c-anonymity, which has been proved to be insufficient [ 1 1 . Recently, several works lfl9l [17 :1 emerge to provide 
differentially private analysis of graph data, which releases some statistics such as the number of triangles about a 
single (large) graph. Two types of differential privacy have been introduced to handle graph data: node differential 
privacy and edge differential privacy. It is still open whether any nontrivial graph statistics can be released under node 
differential privacy due to its inherent large sensitivity (e.g., removing a node in a star graph may result in an empty 
graph). Hay et al. ifTTI consider the problem of releasing the degree distribution of a graph under a variant of edge 
differential privacy. More recently, Karwa et al. [19| propose algorithms to output approximate answers to subgraph 
counting queries, i.e., given a query graph H, returning the number of edge-induced isomorphic copies of H in the 
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input graph. The technique they use is to calibrate noise according to the smooth sensitivity E71l of H in the input 
graph. Karwa et al. The cases when H is triangle, fc-star or A; -triangle are studied in lfT9ll . Unfortunately, their work 
does not support the case when H is an arbitrary subgraph yet. 

In contrast, we have a different problem setting from [19] in this paper. First, like J3], our privacy-preserving 
algorithm is associated with a specific and more complicated data mining task. Second, we consider a graph database 
containing a collection of graphs related to individuals. The only work we can find on privacy protection for a graph 
database is [20], which follows the 'publishing model'. Their goal is to achieve /c-anonymity by first constructing a 
set of super-structures and then generating synthetic representations from them. 

Graph Pattern Mining. Finally, we briefly discuss relevant works on traditional non-private graph pattern mining. 
A more comprehensive survey can be found in 13. Earlier works which aim at finding all the frequent patterns in 
a graph database usually explore the search space in a certain manner. Representative approaches include a priori- 
based (e.g. lfT8l ) and pattern growth based (e.g. gSpan l33l ). An issue with this direction is that the search space 
grows exponentially with the pattern size, which may reach a computation bottleneck. Thus later works aim at mining 
significant or representative patterns with scalability. One way of achieving this is through random walk [3|, which 
also motivates our use of MCMC sampling for privacy preserving purpose. Another remotely related work is RD . 
which connects probabilistic inference and differential privacy. It differs from this work by focusing on inferencing 
on the output of a differentially private algorithm. 

8 Concluding Remarks 

We have presented a novel technique for differentially private mining of frequent graph patterns. The proposed so- 
lution integrates the process of graph mining and privacy protection into an MCMC sampling framework. We have 
explored the design space of the proposal distribution and the score function and their impact on the performance 
of the algorithm. Moreover, we have established the theoretical privacy and utility guarantee of our algorithm. An 
efficient algorithm for counting the neighbors of a pattern has been proposed to greatly reduce the time-consuming 
subgraph isomorphism tests. 

Experiments on both synthetic and real datasets show that with moderate amount of privacy budget, Diff-FPM is 
able to output frequent patterns with over 80% precision and support accuracy. We also notice the drop in utility with 
the increase of the number of outputs or the decrease in dataset size, which is inevitable under the requirement of 
differential privacy. 
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Algorithm 2: The EEN algorithm 



input : Pattern x, graph dataset V, support threshold / 
output: N b (x),Nf(x),N 2 (x) 

1 Initialize N^,Nf,N 2 <- (x omitted for brevity); 

2 Find membership bitmap B x using VF2 isomorphism test; 

3 Populate sub-neighbors N b , super-back neighbors N p ack , super-forward neighbors N p wd ; 
I * Explore sub-neighbors N b 

4 if 3um(B x ) > f then N\ <- N\ U N b ; 

5 else for x' e N b do 

6 if SUB is freq (x', B x ) then N\ «- ATf U {x'}; 

7 else N 2 <— N 2 U {V}; 

/ * Explore super-back neighbors N% ack 

8 if sum(A) < / then N 2 <- 7V 2 U 7V h p acfc ; 

9 else 

Vx' G N p ack , initialize dictionary #[2/] = 0; 
for i <- 1 to \V\ do 

Find set A4 of all mappings between A and x; 



torx'ENj; ack do 

if i? [x'] < / and \V\-i + H[x'\ > f then 

Let (u, v) be the back edge, i.e., x = x' o (u, v); 
for m e M do 

if m{u), m(v) are adjacent in Di then 
H[x'\ <- i?[x'] + 1; 



break; 



for^eAT^do 



JVf U {x'}; 



if i2"[x'] > /then^ < 
else 7V 2 <r- N 2 U {x'}; 

/ * Explore super-forward neighbors N^ wd 

23 if sum(B x ) < f then N 2 ^ N 2 \J N P „ 

24 else 



find' 



0; 



Vx' G N p wd , initialize dictionary H[x'\ 
for i <- 1 fo p| do 

Find set of all mappings between A and x; 
forx'eiV^do 

if H[x'] < f and \D\-i + H[x'] > f then 

Let (u, v) be the forward edge, i.e., x' ~ x o (u, v) and v e x', v ^ x; 
for meMdo 

if 3to e Vb 4 s.r. (w,m(u)) e E D ., l(w) = l(v),w ^ m(V x ) then 
ff[x'] <- i?[x'] + 1; 
break; 

forx'eiV^do 



if # [x'] > / then TVf «- TVf U {x'} ; 
else N 2 <r- N 2 U {x'}; 

38 return Nf,N%, N 2 ; 

39 

40 function sub_is_freq(x', B x ) 

« 5 ^ n e6 x' ^ 

42 C -s- e B\A, a;' C A, A e £>}; 

43 if I B x \ + \C\ > f then return true; 

44 else return false; 
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